Full discrimination of subtopics in search results with keyphrase-based clustering

نویسندگان

  • Claudio Carpineto
  • Massimiliano D'Amico
  • Andrea Bernardini
چکیده

We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at http://keysrc.fub.it. We discuss how the keyphrase-based clustering algorithm can be used not only for browsing through the clustered search results but also for producing a re-ranked list of results emphasizing the diversity of top hits. Using a novel measure for evaluating full-subtopic retrieval performance, called “Subtopic Search Length under k document sufficiency”, and a test collection specifically designed for evaluating subtopic retrieval, we found that our approach was able to discriminate between the different subtopics present in search results in a very effective manner, with a clear improvement over other subtopic retrieval systems. In particular, browsing through KeySRC clusters was the best method to retrieve more documents per subtopic (i.e., k > 1), whereas using the re-ranked list formed from KeySRC clusters was more effecive for retrieving just one document per subtopic (i.e., k = 1).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating subtopic retrieval methods: Clustering versus diversification of search results

To address the inability of current ranking systems to support subtopic retrieval, two main post-processing techniques of search results have been investigated: clustering and diversification. In this paper we present a comparative study of their performance, using a set of complementary evaluation measures that can be applied to both partitions and ranked lists, and two specialized test collec...

متن کامل

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

Entity Tracking in Real-Time Using Sub-topic Detection on Twitter

The velocity, volume and variety with which Twitter generates text is increasing exponentially. It is critical to determine latent sub-topics from such tweet data at any given point of time for providing better topic-wise search results relevant to users’ informational needs. The two main challenges in mining subtopics from tweets in real-time are (1) understanding the semantic and the conceptu...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ارزیابی روش‌های گروه‌بندی ژنوتیپ های کلزا با استفاده از تجزیه تابع تشخیص خطی فیشر

Discrimination function analysis is a method of multivariate analysis that can be used for determination of validity in cluster analysis. In this study, Fisher’s linear discrimination function analysis was used to evaluate the results from different methods of cluster analysis (i.e. different distance criteria, different cluster procedures, standardized and un-standardized data). Furthermore, H...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Web Intelligence and Agent Systems

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2011